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Abstract 


The relationship between protein mutations and con- 
formational change can potentially decipher the language 
relating sequence to structure. Elsewhere, we presented 
the Protein Mutant Resource (PMR), an online tool that 
systematically identified related mutants in the Protein 
DataBank (PDB), inferred mutant Gene Ontology classi- 
fications using data-mining, and allowed intuitive explo- 
ration of relationships between mutant structures. Here, 
we perform a comprehensive statistical analysis of PMR 
mutants. Although the PMR contains spectacular confor- 
mational changes, generally there is a counter-intuitive 
inverse relationship between conformational change and 
the number of mutations. That is, PDB mutations contrast 
naturally evolved mutations. We compare the frequencies 
of mutations in the PMR/PDB datasets against the 
PAM250 natural mutation frequencies to confirm this. We 
make available morph movies from PMR structure pairs, 
allowing visual analysis of conformational change and 
the ability to distinguish visually between conformational 
change due to motions (e.g., ligand binding) and muta- 
tions. The PMR is at http://pmr.sdsc.edu. 


1. Introduction 


Rational drug design seeks to use knowledge of a 
protein’s three-dimensional chemical structure as a target 
against which to design new drugs. This particular appli- 
cation of x-ray crystallography has likely been one of the 
principal economic factors driving growth in the experi- 
mental field. However, mutant proteins may occur natu- 


rally in the target host population or, in the case of drugs 
such as antibiotics, may evolve in the parasite as a form 
of drug resistance. Scientists would like to have an under- 
standing of how a putative drug interacts not only with 
the wild-type protein, but also with its likely mutants. The 
cost of experimental determination of mutant structures is 
often still prohibitive. An extensive structural database of 
proteins and neighboring mutants can be expected to as- 
sist scientists in visualizing likely structural changes 
brought about by mutation and would likely find immedi- 
ate application in rational drug design through improved 
homology modeling, which is of importance in the phar- 
maceutical industry [1-4]. 

The deduction of a detailed, three-dimensional 
chemical structure of a protein from its genetic sequence 
is a fundamental and long-studied problem in structural 
biology [5-7], of importance in de novo structure predic- 
tion, protein folding, crystallographic refinement phasing, 
molecular dynamics, and computational chemistry. At 
present, it is solved principally through the labor- 
intensive but effective process of X-ray crystallography 
and NMR. A direct study of existing data on the effects of 
protein mutation on protein structure can have immediate 
payoffs [7, 8]. 

Although databases of mutant gene products [9-12] 
as well as specialized databases of mutant protein struc- 
tures have previously been developed [13-15], the PMR 
[16] was the first PDB-wide [17] database of mutant pro- 
tein structures. Entering the PDB ID of a structure in the 
PMR into the entry form on the PMR home page brought 
up the sequence of the wild-type structure for that mutant 
family along with a listing of the differences in amino 
acid sequence between the wild-type and the selected 
PDB ID (Figure 1). Users could click on any of the muta- 
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Figure 1. Web screen shot of the PMR mutant browser. Upon selecting a specific mutational site 
from the sequence all mutations at that site are shown and the site is shown on the 3-D structure 
(requires Chime). A torsion angle analysis and a morph movie analysis displaying the conforma- 
tional change between the selected PDB ID and the designated wild-type for the particular mutant 
family can also be displayed. 


tion sites listed for the wild-type structure to obtain a list- 
ing of the available mutant structures with modifications 
at the amino acid position (Figure 1). An anticipated use 
of the PMR was in protein engineering. Scientists seeking 
to modify the sequence of an existing protein to express a 
slightly different structure could search the PMR for se- 
quences matching the protein’s current sequence and then 
examine the stored mutations for structure variants [18, 
19]. The PMR website had a number of other innova- 
tions; in particular, the PMR GO classification feature 
utilized an improved means of and database-wide statisti- 
cally rigorous gene annotation and data-mining with 
widespread applicability [20, 21]. PMR database entries 
interacted with a number of external databases (Mol- 
MovDB [22-25], GO [26], PubMed/Entrez [27], PDBsum 


[28]) as well as the PDB. Consequently, the PMR could 
be used as a portal by those studying families of proteins 
of closely related sequence within the PDB. 

Here, we characterized the effect of PMR mutations 
on protein tertiary structure statistically and detected a 
potential selective bias in available PMR/PDB structures 
of mutant proteins. To confirm this, we compared the 
frequency of mutations in PMR/PDB datasets against 
accepted PAM 250 natural amino acid mutation frequen- 
cies. We further improved on the PMR web interface by 
generating and making morph movies of the conforma- 
tion changes available on the web. These automatically 
generated morph movies assist scientists in visually dis- 
criminating between conformational changes caused by 
protein motions [25] from those caused primarily by 


changes in protein sequence [16]. We believe both our 


Table 1. Mutation statistics for the PMR 
database taken from 1157 PDB structures. 

Total number of mutations (chains) 3343 
Wild-type PDB chains 194 
Non wild-type PDB chains 3149 

Number of PDB IDs associated with 

mutations 

Mutation Sites 1157 

Average number of residues mutated | 2.9 

per chain 

Most commonly mutated amino acid | Alanine 

in PMR 


statistical analysis of PMR/PDB data and our morph 
movies of PMR data will be of general interest to the 
structural bioinformatics community. Our morph movies 
are freely available off the PMR _ website 
(http://pmr.sdsc.edu). 


2. Materials and Methods 


As described elsewhere [16], the PMR was generated 
by automatically clustering the PDB [17] at 95% se- 
quence identity using the CD-HIT sequence clustering 
approach [29]. CD-HIT uses a greedy algorithm [30] to 
sort and process sequences in order of decreasing length; 
the longest sequence in each cluster becomes its represen- 
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Figure 2. Distribution of the number of mutations per polypeptide chain in the PMR. The specific 
PDB and chain identifiers for each polypeptide chain with > 10 mutations are shown. 


tative. Efficiency is achieved because sequences are proc- 
essed by comparing them only against the representative 
sequences for each established cluster to decide whether 
they should be added to an existing cluster or become the 
representative for a new cluster. 

The resulting clusters were then manually filtered 
into species-based families and a ‘wild-type’ PDB chain 
was manually selected from each family by inspection of 


24, 25, 32, 33] (http://molmovdb.org) was applied to 
PMR data to make four-dimensional visual illustrations 
[22, 23, 34] of the conformation changes induced by mu- 
tation. Structural change is given in terms of Ca dis- 
placement for each possible, non-redundant wild-type and 
mutant structure pair in each PMR family using a sieve- 
fitting superposition technique [35] which interpolates 
between solved structures to provide a visual rendering. 


Mean C-alpha displacement by mutation type 


Mean C-alpha displacement 
(Angstroms) 


GASNI 


X TRCLWFDHY PQKVME 


Residue mutated 


Figure 3. Mean Ca displacement by mutant residue type for polypeptide chains with single muta- 
tions. 


the scientific literature [16]. Software was developed to 
automatically find and add new PDB entries to existing 
PMR families on a regular basis. The resulting data were 
loaded into Oracle tables and made freely accessible via a 
web interface, developed in Perl [31], Oracle SQL, 
Chime, JavaScript, and HTML. 

Morph movie technology [22, 23] originally devel- 
oped for the Database of Macromolecular Motions [22, 


Protein motions are available as animated GIF images on 
the PMR website (http://pmr.sdsc.edu; Figure 1). Sum- 
mary statistics on the types of mutation and the motions 
induced are given in Table 1. The number of mutations 
per structure is given in Figure 2. Ca displacements by 
mutant residue type in PMR data are given in Figure 3. 
The frequency of amino acid mutations in PMR data were 


Table 2. PAM250 mutation frequencies compared with PMR mutation frequencies. The rightmost value 
in each cell gives the mutational frequency in the PMR between any two amino acids. This was 
computed by tabulating the occurrences of that particular mutation data between wild-type and mutant 
chains throughout the PMR, and then linearly normalizing that fractional value to the same scale as 
PAM250. The leftmost value in each cell gives the raw PAM250 accepted natural amino acid mutation 
rate [36]. Lower (more negative) numbers indicate an observed reduced tendency to mutate. PMR 
values along the diagonal are undefined, so only PAM250 values are given. Values in which the 
difference between the PAM250 accepted mutation rate and the PMR rate exceed more than half of the 
range are shaded; about 5% of the values are designated this way. 
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Figure 4. Bar chart of mutational frequencies of amino acids in the PMR and the accepted natural 
PAM250 amino acid mutation rate [36]. Mutational frequencies for PMR and PAM250 datasets were 
summed vertically and horizontally (Table 2) for each amino acid to generate a mutational frequency 
for each amino acid. Mutational frequencies were then sorted by the PAM250 values (foreground). 


tabulated and compared with the accepted PAM250 natu- 
ral amino acid mutation frequencies [36] and are shown 
in Table 2 and Figure 4. A scatter-plot depicting the rela- 
tionship of number of mutations with structural change 
(as measured by previously computed Ca displacements) 
is shown in Figure 5. 


3. Results 


A morph analysis of all possible structure pair com- 
binations within PMR families would have O(N’) com- 
plexity with the size of family in terms of both disk space 
and CPU time and was not computationally tractable. 
Instead, we limited our analysis to non-redundant combi- 
nations of the members of each family and its wild-type. 
This has O(N) complexity and reduced the task by several 


orders of magnitude. Running in parallel, our morph 
server software required three days of CPU time on a 
sixteen-CPU cluster of four-CPU Sun Ultra-80 servers 
running SunOS 5.7. The resulting morph movies and sta- 
tistical data require 4.3 gigabytes of disk storage. 

The distribution of numbers of mutations per poly- 
peptide chain is given in Figure 2. Greater than 99% of 
the PMR mutant structures have 9 or fewer mutations 
when compared to the wild-type. The rapid fall-off in 
number of mutations is expected as these structures are 
usually studied to understand the impact of single or a 
small number of correlated mutations. Alanine is the resi- 
due most commonly used to mutate structures (Table 1) 
presumably to change the functional role of a given resi- 
due and through the presence of a C-beta carbon still con- 
fer side chain directionality and some sense of side chain 
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Figure 5. Number of mutations per polypeptide chain versus structural change (C-alpha 
displacement). 


volume in a neutral substitution. Single mutations involv- 
ing glycine result in the largest structural changes (Figure 
3) presumably as a result of significant stereochemical 
change in side chain volume and possibly physicochemi- 
cal property. 

There is a significant difference in the mutational 
frequencies for PMR versus the natural PAM250 fre- 
quencies (Figure 4 and Table 2). For example, the table 
shows that the mutation of phenylalanine to tyrosine 
seems disfavored by structural biologists despite a rela- 
tively high frequency during the course of evolution. 
Conversely, the conversion of alanine to cysteine, un- 
common during the course of evolution, is favored by 
structural biologists, presumably to add stability to a 
structure under study. Divergence from natural frequen- 
cies of mutation is not surprising since structural biology 
is often concerned with the engineering of proteins to test 
function. Not surprisingly, loss of function is less com- 
mon in nature than it is in the laboratory. It is not difficult 
to engineer a protein that retains its structure, but which 
is biologically inactivate or has a significantly reduced 
activity. 

Given the complete body of structural data showing 
mutation from the wildtype, contrary to what one would 
expect, structural change shows an inverse relationship to 
the number of mutations (Figure 5) — large structural 
changes are induced by small numbers of mutations, 
whereas a large number of mutations can lead to rela- 
tively small changes. Consider several specific cases 
shown in the PMR data (Figure 5). The largest structural 
change in the PMR occurs in TAQ DNA Polymerase 
[37]. A number of structures of TAQ DNA polymerase 
have been solved with an alanine instead of the wild-type 
glycine at position 152. These all show an unusually large 
140A Co displacement. Crystal contacts as a cause may 
be ruled out since the mutant structure has been solved in 
a number of different space groups, including the same 
space group as the wild-type, and all structures give an 
identical Ca displacement. The 140 A Co displacement 
structures are all bound to DNA, whereas the 1TAQ wild- 
type [37] is an unbound structure indicating that the un- 
usually large conformation change here is likely due to 
the combined effect of a DNA clamping protein motion 
as well as the single-point mutation. Thus the PMR does 
not distinguish between conformational changes induced 
by ligand binding or complex formation and that induced 
simply by point mutation. However, large changes in- 
duced by point mutations alone do occur. Analysis of the 
structural change between 7ADH [38] and the designated 
wild-type alcohol dehydrogenase structure 1ADG [39] 
shows a displacement of 66A caused by structural 
changes resulting from 22 mutations. Visualization of the 
morphing shows chain breakages and backbone elements 
passing through each other. That is, unlike TAQ DNA 
polymerase where the structural change represents an 


observable physiological change, here there are discreet 
and distinct states representing different structures. 


4. Discussion and Conclusion 


The addition of a morph analysis to the PMR permits 
the visualization of conformational change induced by 
changes to the wildtype protein. However care and a re- 
view of the original PDB files is needed to distinguish 
between conformational change induced by mutation (of- 
ten indicated by alternative folding, chain breakage, or the 
passage of backbone and sidechains atoms through one 
another) and protein motions caused by ligand binding 
rather than mutation, which appear as smooth transitions. 
It is necessary to often refer back to the original PDB files 
to understand the cause of conformational change. With 
improvements to ligand descriptions within the PDB we 
anticipate better annotating and classifying specific mo- 
tions in the future. 

Induced mutation while retaining structure shows a 
different patter of substitutions that that observed in na- 
ture (Table 2 and Figures 4 and 5). Evolutionary sequence 
drift usually preserves structure and function whereas 
structural biologists often mutate proteins specifically to 
change structure and function, or indeed to induce better 
structure formation to better nature [40]. A useful addi- 
tion to the PMR would be mutations that prevented a 
structure from being observed. Such negative data has not 
traditionally found its way to the literature or to public 
databases, but that will likely change with the advent of 
structural genomics. 

Our computation of morph movie representations en- 
ables the broader structural bioinformatics community to 
analyze and represent protein mutation conformational 
change visually. Computation of these biologically inter- 
esting results was made tractable by a number of new 
algorithms [29] and design decisions [23] within our 
software pipeline. These results and methods complement 
existing algorithms [20, 21] used to generate data for 
other areas of the PMR website. 
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